1 Today’s Goals

1.1 Running a line of code, clearing console, and clearing environment

  • If you follow the arrows in the above picture, you’ll learn the following:
    • Learn how to run a line of code
    • Learn how to clear the console and the environment

2 Introduction to install.packages() and library()

# install.packages("tidyverse")
# install.packages("readxl")
# install.packages("writexl")
# install.packages("ggplot2")
# install.packages("gapminder")
# install.packages("scales")
library(tidyverse)
library(ggplot2)
library(readxl)
library(writexl)
library(gapminder)
library(scales)

3 Introduction to set.wd() and get.wd()

4 Introduction to R script, console, and environment.

5 Data on Excel

6 Importing the Gapminder excel file

gapminder_tbl <- read_xlsx("gapminder.xlsx")
gapminder_tbl

7 Some Mathematical Operations

7.1 Addition

3+4
## [1] 7

7.2 Subtraction

5-3
## [1] 2

7.3 Multiplication

3*5
## [1] 15

7.4 Division

3/4
## [1] 0.75

7.5 Raised to the power

3^4
## [1] 81

7.6 Assigning value to a variable “a”

a <- 3+4

a
## [1] 7

7.7 Overwriting a variable

a <- 95/7

a 
## [1] 13.57143

8 Data Types

# numeric
class(7)
## [1] "numeric"
class(7.2)
## [1] "numeric"
# character
class("abcd")
## [1] "character"
# factor
class(as.factor("High"))
## [1] "factor"
# logical

class(TRUE)
## [1] "logical"

9 Manipulating Dataset

9.1 What is a pipe operator?

  • Pipe operator is denoted by symbol %>%. Shortcut for the pipe operator is Shift + CMD/ CTRL + M
  • Pipe operator allows us to pass the output of a function as a input to the other one in sequence.

9.2 view()

  • view() allows us to take a look at the whole dataset.
gapminder_tbl %>% 
  view()

9.3 glimpse()

gapminder %>% 
  glimpse()
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
  • glimpse() allows us to take a quick glance at the structure of our dataset. It allows us to see what type of variables are present in our dataset.

9.5 tail()

gapminder_tbl %>% 
  tail()
  • tail() returns the last six observations from our dataset.

9.6 select() and why we use select()?

  • Imagine we are working on a hypothetical dataset with 150 columns. Out of those 150 columns we only need 5 columns at max. This is when select() comes in handy.

  • In our dataset, let’s say we only want country, continent, year, and population as our columns.

gapminder_tbl %>% 
  select(country, continent, year, pop)

9.6.1 Using everything() to select the rest of the columns

  • Now, let’s say we want the continent column at the very beginning followed by remaining columns.
  • In the code chunk below, everything() allows us to select rest of the columns instead of manually typing them out.
gapminder_tbl %>% 
  select(continent, everything())

9.7 filter()

  • filter() allows us to filter the observations by rows.
  • One of our main goals today is to create a dataset with observations coming from France only.
  • Here’s how we do it:
  • I know, you are seeing a double equals to ==. This == is an equality operator. This allows you to see whether two objects are equal or not.
  • However, when using filter(), double equals (==) means equal to and != means not equal to.
  • Using a single equals sign will often give an error message that is not intuitive, so make sure you check for this common error.
gapminder_france_tbl <- gapminder_tbl %>% 
  filter(country == "France")

gapminder_france_tbl

9.7.1 Let’s work on some examples:

  • Let’s say we only want observations from Asia
  • Let’s say we only want observations from the year 1952
  • Let’s say we don’t want observations from Europe
# gapminder_tbl %>% 
#   filter(continent == "Asia")
# 
# gapminder %>% 
#   filter(year == 1952)
# 
# gapminder %>% 
#   filter(continent != "Europe")

9.8 count()

  • count() allows us to quickly count unique values of one or more variables.
  • Let’s say we want to know how many times a particular continent appeared in our dataset.
  • sort = TRUE arranges the column in descending order.
gapminder_tbl %>% 
    count(continent, sort = TRUE) 

9.9 mutate()

  • mutate() allows us to create new columns or modify the existing columns.
  • Let’s say we want to increase the population of every country by 10 times.
  • The code chunk below demonstrates an example of creating new column using mutate().
gapminder_tbl %>% 
  mutate(pop_increased_10_times = pop * 10)
# gapminder %>% 
#   mutate(pop_increased_by_10 = pop + 10)

9.10 Converting a character variable into categorical variable and vice versa

gapminder_tbl %>% 
  mutate(continent = as.factor(continent)) %>% 
  glimpse()
## Rows: 1,704
## Columns: 6
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
gapminder_tbl %>% 
  mutate(continent = as.factor(continent)) %>% 
  mutate(continent = as.character(continent)) %>% 
  glimpse()
## Rows: 1,704
## Columns: 6
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

9.11 arrange()

  • arrange() allows us to arrange columns in ascending (aesc(variable_name)) or descending (desc(variable_name)) order.
  • Default is ascending order.
gapminder_afg_asc <- gapminder_tbl %>% 
    filter(country == "Afghanistan") %>% 
    arrange(pop)

gapminder_afg_asc
gapminder_afg_desc <- gapminder %>% 
    filter(country == "Afghanistan") %>%
    arrange(desc(pop))

gapminder_afg_desc

9.12 Creating bins using ntile()

  • ntile() takes in your entire column and decides what cut-points to use and bins it accordingly into however many bins you want.
gapminder_tbl %>% 
    mutate(gdpPercap_bin = ntile(gdpPercap, 3))

9.13 ntile() and case_when()

gapminder_tbl %>% 
  mutate(
    gdpPercap_bin2 = case_when(
      gdpPercap > quantile(gdpPercap, 0.66) ~ "High",
      gdpPercap > quantile(gdpPercap, 0.33) ~ "Medium",
      TRUE ~ "Low"
    )
  ) 

9.14 group_by() and summarise()

  • group_by() and summarise() always go hand-in-hand.
  • group_by() takes an existing table and converts it into a grouped table where operations are performed. And, these operations are performed using summarise()
  • Once you use, group_by() and summarise(), make sure to ungroup().
  • Let’s say we want to know the population of each continent in year 1952.
gapminder_tbl %>% 
  filter(year == 1952) %>% 
  group_by(continent) %>% 
  summarise(population = sum(pop)) %>% 
  ungroup() %>% 
  arrange(desc(population))

10 Visualization

10.1 Not formatted visualization

gapminder_tbl %>% 
    filter(year == 1952) %>% 
    group_by(continent) %>% 
    summarise(total_population = sum(pop)) %>% 
    ungroup() %>% 
    # arrange(desc(total_population)) %>% 
    mutate(continent = as.factor(continent)) %>% 
    # Visualize
    
    ggplot(aes(continent, total_population))+
    geom_col(fill = "#2c3e50", width = 0.5)+
    
    scale_y_continuous(labels = scales::comma)+
    theme_minimal()+
    
    labs(title = "Population of Different Continents in 1952",
         x = "",
         y = "Population",
         subtitle = "",
         caption = "Data Source: Gapminder")

10.2 Formatted Visualization

gapminder_tbl %>% 
   filter(year == 1952) %>% 
    group_by(continent) %>% 
    summarise(total_population = sum(pop)) %>% 
    ungroup() %>% 
    arrange(desc(total_population)) %>% 
    mutate(continent = as_factor(continent)) %>%
    # Visualize

    ggplot(aes(continent, total_population))+
    geom_col(fill = "#2c3e50", width = 0.5)+

    scale_y_continuous(labels = scales::comma)+
    theme_minimal()+

    labs(title = "Population of Different Continents in 1952",
         x = "",
         y = "Population",
         subtitle = "",
         caption = "Data Source: Gapminder")

11 Saving gapminder_france_tbl as an excel file

# writexl::write_xlsx(gapminder_france_tbl, path = "gapminder_france.xlsx")

12 Correlation Plot

12.1 Plot 1

# install.packages("corrplot")
library(corrplot)

gapminder_tbl %>% 
    select(year:gdpPercap) %>% 
    cor() %>% 
    corrplot(method = "number")

12.2 Plot 2

gapminder_tbl %>%
    select(year:gdpPercap) %>%
    cor() %>%
    corrplot(method = "color", order = "alphabet")